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ABSTRACT 

Many studies have been conducted on seeking the efficient solution 
for subgraph similarity search over certain (deterministic) graphs 
due to its wide application in many fields, including bioinformat- 
ics, social network analysis, and Resource Description Framework 
(RDF) data management. All these works assume that the underly- 
ing data are certain. However, in reality, graphs are often noisy 
and uncertain due to various factors, such as errors in data ex- 
traction, inconsistencies in data integration, and privacy preserving 
purposes. Therefore, in this paper, we study subgraph similarity 
search on large probabilistic graph databases. Different from pre- 
vious works assuming that edges in an uncertain graph are inde- 
pendent of each other, we study the uncertain graphs where edges' 
occurrences are correlated. We formally prove that subgraph sim- 
ilarity search over probabilistic graphs is #P-complete, thus, we 
employ afilter-and-verify framework to speed up the search. In the 
filtering phase, we develop tight lower and upper bounds of sub- 
graph similarity probability based on a probabilistic matrix index, 
PMI. PMI is composed of discriminative subgraph features associ- 
ated with tight lower and upper bounds of subgraph isomorphism 
probability. Based on PMI, we can sort out a large number of prob- 
abilistic graphs and maximize the pruning capability. During the 
verification phase, we develop an efficient sampling algorithm to 
validate the remaining candidates. The efficiency of our proposed 
solutions has been verified through extensive experiments. 

I. INTRODUCTION 

Graphs have been used to model various data in a wide range 
of applications, such as bioinformatics, social network analysis, 
and RDF data management. Furthermore, in these real applica- 
tions, due to noisy measurements, inference models, ambiguities of 
data integration, and privacy-preserving mechanisms, uncertainties 
are often introduced in the graph data. For example, in a protein- 
protein interaction (PPI) network, the pairwise interaction is de- 
rived from statistical models [5, 6, 20], and the STRING database 
(http://string-db.org) is such a public data source that contains PPIs 
with uncertain edges provided by statistical predications. In a so- 
cial network, probabilities can be assigned to edges to model the 
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degree of influence or trust between two social entities [2, 25, 14]. 
In a RDF graph, uncertainties/ inconsistencies are introduced in 
data integration where various data sources are integrated into RDF 
graphs [18, 24]. To model the uncertain graph data, a probabilistic 
graph model is introduced [27, 43, 21, 18, 24]. In this model, each 
edge is associated with an edge existence probability to quantify the 
likelihood that this edge exists in the graph, and edge probabilities 
are independent of each other. However, the proposed probabilis- 
tic graph model is invalid in many real scenarios. For example, 
for uncertain protein-protein interaction (PPI) networks, authors in 
[9, 28] first establish elementary interactions with probabilities be- 
tween proteins, then use machine learning tools to predict other 
possible interactions based on the elementary links. The predic- 
tive results show that interactions are correlated, especially with 
high dependence of interactions at the same proteins. Given an- 
other example, in communication networks or road networks, an 
edge probability is used to quantify the reliability of link [8] or the 
degree of traffic jam [16]. Obviously, there are correlations for the 
routing paths in these networks [16], i.e., a busy traffic path often 
blocking traffics in nearby paths. Therefore, it is necessary for a 
probabilistic graph model to consider correlations existed among 
edges or nodes. 

Clearly, it is unrealistic to model the joint distribution for the en- 
tire set of nodes in a large graph, i.e., road and social networks. 
Thus, in this paper, we introduce joint distributions for local nodes. 
For example, in graph 001 of Figure 1, we give a joint distribution 
to measure interactions (neighbor edges 1 ) of the 3 nodes in a lo- 
cal neighborhood. The joint probability table (JPT) shows the joint 
distribution, and a probability in JPT (the second row) is given as 
Pr(ei = l,e2 = l,e3 = 0) = 0.2, where "1" denotes exis- 
tence while "0" denotes nonexistence. For larger graphs, we have 
multiple joint distributions of nodes in small neighborhoods (in 
fact, these are marginal distributions). In real applications, these 
marginal distributions can be easily obtained. For example, authors 
in [16] use sampling methods to estimate a traffic joint probability 
of nearby roads, and point out that the traffic joint probability fol- 
lows a multi-gaussian distribution. For PPI networks, authors in [9, 
28] establish marginal distributions using a Bayesian prediction. 

In this paper, we study subgraph similarity search over proba- 
bilistic graphs due to wide usage of subgraph similarity search in 
many application fields, such as answering SPARQL query (graph) 
in RDF graph data [18, 1], predicting complex biological interac- 
tions (graphs) [33, 9], and identifying vehicle routings (graphs) in 
road networks [8, 16]. In the following, we give the details about 
subgraph similarity search, our solutions and contributions. 



Neighbor edges are the edges that are incident to the same vertex or the 
edges of a triangle. 
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Figure 1 : Probabilistic graph database & Query graph 

1.1 Probabilistic Subgraph Matching 

In this paper, we focus on threshold-based probabilistic sub- 
graph similarity matching (T-PS) over a large set of probabilistic 
graphs. Specifically, let D = {gi, f/2, ...,Jn} be a set of proba- 
bilistic graphs where edges' existences are not independent, but are 
given explicitly by joint distributions, q be a query graph, and e be 
a probability threshold, a T-PS query retrieves all graphs g G D 
such that the subgraph similarity probability (SSP) between q and 
g is at least e. We will formally define SSP later (Def 9). We em- 
ploy the possible world semantics [31, 11], which has been widely 
used for modeling probabilistic databases, to explain the meaning 
of returned results for subgraph similarity search. A possible world 
graph (PWG) of a probabilistic graph is a possible instance of the 
probabilistic graph. It contains all vertices and a subset of edges 
of the probabilistic graph, and it has a weight which is obtained 
by joining joint probability tables of all neighbor edges. Then, for 
a query graph q and a probabilistic graph g, the probability that q 
subgraph similarly matches g is the summation of the weights of 
those PWGs, of g, to which q is subgraph similar. If q is sub- 
graph similar to a PWG <?', g' must contain a subgraph of q, say g', 
such that the difference between q and q' must be less than the user 
specified error tolerance threshold 5. In other words, q is subgraph 
isomorphic to g after q is relaxed with 5 edges. 

Example 1. Consider graph 002 in Figure 1 . J PT\ and J PT2 
give joint distributions of neighbor edges {ei, e2, 63} and {e3, e4, 
e§} respectively. Figure 2 lists partial PWGs of probabilistic graph 
002 and their weights. The weight of PWG (1) is obtained by join- 
ing ti of JPT\ and £2 of JPT2, i.e., Pr(ei = 1, e<2 = \,e§ = 
l,e 4 = l,e 5 = 0) = Pr(ei = l,e 2 = l,e 3 = 1) x Pr(e s = 
l,e4 = l,e5 = 0) = 0.3 x 0.25 = 0.075. Suppose the distance 
threshold is 1. To decide if q subgraph similarly matches proba- 
bilistic graph 002, we first find all of 002 's PWGs that contain a 
subgraph whose difference between q is less than 1. The results 
are PWGs (1), (2), (3) and (4), as shown in Figure 2, since we can 
delete edge a, b or c of q. Next, we add up the probabilities of these 
PWGs: 0.075 + 0.045 + 0.075 + 0.045 + ... = 0.45. If the query 
specifies a probability threshold of 0.4, then graph 002 is returned 
since 0.45 > 0.4. 

The above example gives a naive solution, to T-PS query process- 
ing, that needs to enumerate all PWGs of a probabilistic graph. This 
solution is very inefficient due to the exponential number of PWGs. 
Therefore, in this paper, we propose a filter-and-verify method to 
reduce the search space. 

1.2 Overview of Our Approach 

Given a set of probabilistic graphs D = { g\ , . . . , g n } and a query 
graph q, our solution performs T-PS query processing in three steps, 
namely, structural pruning, probabilistic pruning, and verification. 



0.075 0.045 

(3) (4) 

Figure 2: Partial possible world graphs of probabilistic graph 002 

Structural Pruning 

The idea of structural pruning is straightforward. If we remove 
all the uncertainty in a probabilistic graph, and q is still not sub- 
graph similar to the resulting graph, then q cannot subgraph simi- 
larly match the original probabilistic graph. 

Formally, for g £ D, let g c denote the corresponding determin- 
istic graph after we remove all the uncertain information from g. 
We have 

Theorem 1. If q £ aim g c , Pr(q C sim g) = 0. 

where C. s im denotes subgraph similar relationship (Def 8), and 
Pr(q C sim g) denotes the subgraph similarity probability of q 
log. 

Based on this observation, given D and q, we can prune the 
database D c = {gl, g„} using conventional deterministic graph 
similar matching methods. In this paper, we adopt the method in 
[38] to quickly compute results. [38] uses a multi-filter composi- 
tion strategy to prune large number of graphs directly without per- 
forming pairwise similarity computation, which makes [38] more 
efficient compared to other graph similar search algorithms [15, 
41]. Assume the result is SCJ = {g c \q C sim g c ,g c g D c }. Then, 
its corresponding probabilistic graph set, SC q = {g\g c £ SCq}, is 
the input for uncertain subgraph similar matching in the next step. 

Probabilistic Pruning 

To further prune the results, we propose a Probabilistic Matrix In- 
dex (PMI) that will be introduced later, for probabilistic pruning. 
For a given set of probabilistic graphs D and its corresponding set 
of deterministic graphs D c , we create a feature set F from D c , 
where each feature is a deterministic graph, i.e., F C D c . In PMI, 
for each g 6 SC q , we can locate a set 

D g = {{LowerB{Ji), Upper B(Ji)}\ ft C lso g c , 1 < j < \F\} 

where Lower B(f) and Upper B{f) are the lower and upper bounds 
of the subgraph isomorphism probability of / to g (Def 6), denoted 
by Pr(f C iso g). In this paper, C iso is used to denote subgraph- 
isomorphism. If / is not subgraph isomorphic to g c , we have (0). 

In the probabilistic filtering, we first determine the remaining 
graphs after q is relaxed with S edges, where 8 is the subgraph dis- 
tance threshold. Suppose the remaining graphs are {rqi, ...rqi, ... 
rq a }. For each rqi, we compute two features // and / s 2 in D g such 
that rqt Dj so fj and rqi C iso /?. Let Pr(q C sim g) denote the 
subgraph similarity probability of q to g (Def 9). Then, we can cal- 
culate upper and lower bounds of Pr(q C sim g) based on the val- 
ues of UpperB(fl) and Lower B(ff) for 1 < i < a respectively. 
If the upper bound of Pr(q C sim g) is smaller than probability 
threshold e, g is pruned. If the lower bound of Pr(q C sim g) is 
not smaller than e, g is in the final answers. 
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Verification 

In this step, we calculate Pr(q C sim g) for query q and candidate 
answer g, after probabilistic pruning, to make sure g is really an 
answer, i.e. Pr(q C_ sirn g) > e. 

1.3 Contributions and Paper Organization 

The main idea (contribution) of our approach is to use the proba- 
bilistic checking of feature-based index to discriminate most graphs. 
To achieve this, several challenges need to be addressed. 

Challenge 1: Determine best bounds of Pr(q c sim g) 

For each rqi, we can find many f-~& and /, 2 s in PMI, thus, a large 
number of bounds of Pr(q C sim g) based on the combination of 
UpperB(fl) and LowerB(ff) for 1 < i < a can be computed. 
In this paper, we convert the problem of computing the best upper 
bound into a set cover problem. Our contribution is to develop an 
efficient randomized algorithm to obtain the best upper bound using 
integer quadratic programming, which is presented in Section 3. 

Challenge 2: Compute an effective D g 

An effective D g should consist of tight Upper B(f) and Lower B( 
/) whose values can be computed efficiently. As we will show 
later that calculating Pr(f C_ iso g) is #P-complete, which in- 
creases the difficulty of computing an effective Df. To address this 
challenge, we make a contribution to derive tight UpperB(f) and 
Lower B(f) by converting the problem of computing bounds into 
a maximum clique problem and propose an efficient solution by 
combining the properties of probability conditional independence 
and graph theory, which is discussed in Section 4.1. 

Challenge 3: Find the features that maximize pruning 

Frequent subgraphs (mined from D c ) are commonly used as fea- 
tures in graph matching. However, it would be impractical to index 
all of them. Our goal is to maximize the pruning capability with 
a small number of features. To achieve this goal, we consider two 
criteria in selecting features, the size of the feature and the num- 
ber of disjoint embeddings that a feature has. A feature of small 
size and many embeddings is preferred. The details about feature 
selection are given in Section 4.2. 

Challenge 4: Compute SSP efficiently 

Though we are able to filter out a large number of probabilistic 
graphs, computing the exact SSP in the verification phase may still 
take quite some time and become the bottleneck in query process- 
ing. To address this issue, we develop an efficient sampling algo- 
rithm, based on the Monte Carlo theory, to estimate SSP with a 
high quality, which is presented in Section 5. 

In addition, in Section 2, we formally define T-PS queries over 
probabilistic graphs and give the complexity of the problem in Sec- 
tion 2. We discuss the results of performance tests on real data sets 
in Section 6 and the related works in Section 7. We conclude our 
work in Section 8. 

2. PROBLEM DEFINITION 

In this section, we define some necessary concepts and show the 
complexity of our problem. Table 1 summarizes the notations used 
in this paper. 

2.1 Problem Definition 

Definition 1. (Deterministic Graph) An undirected determin- 
istic graph 2 g c , is denoted as (V, E, E, L), where V is a set of ver- 
tices, E is a set of edges, E is a set of labels, and L : V U E — > E is 

2 In this paper, we consider undirected graphs, although it is straightforward 
to extend our methods to directed graphs. 



Symbol 


Description 


D, SC q , A q 


the probabilistic database set 


D c , SC° 


the deterministic database 


9 


the probabilistic graph 




the user-specified probability threshold 




the subgraph distance threshold 


f, i, a' g c 


the deterministic graph 


U — {rqi , .., rq a } 


the remaining graph set after q is relaxed with 

o edges 


L Oil) GrT" ^R i^J'^ 
TJrmprR( f) 


thp Inwpr unrl untvr hnnnH*; n~f ^IP 

lUWtl illlU LlJJjJd LNJUIILIA Ui O LL 


L s im(q), U s i m q 


the lower and upper bounds of SSP 


Rrn- Rf Rr 


thp Rnnlpan v;iri:ih1p« nf nnprv prnhprlrlintT 

1_J *JUlt all V al latflC A VJ1 *^Utl^, ClllUCLlU-lllg 

and cut 


Ef, Ec 


the set of embeddings and cuts 


IN 


the set of disjoint embeddings 


F 


the feature set 


Pr(x ne ) 


the joint probability distribution of neighbor 
edges 


Pr(q Qiso 9) 


the isomorphism between q and g 


Pr{q C sim g) 


the subgraph similarity probability between q 
and g 



Table 1: Notations 



a function that assigns labels to vertices and edges. A set of edges 
are neighbor edges, denoted by ne, if they are incident to the same 
vertex or the edges form a triangle in g c . 

For example, consider graph 001 in Figure 1. Edges ei, e 2 and 
ez are neighbor edges, since they form a triangle. Consider graph 
002 in Figure 1. Edges ez, e^, and es are also neighbor edges, since 
they are incident to the same vertex. 

Definition 2. (Probabilistic Graph ) A probabilistic graph is de- 
fined as g = (g c , Xe), where g c is a deterministic graph, and Xe 
is a binary random variable set indexed by E. An element x e £ Xe 
takes values and 1, and denotes the existence possibility of edge 
e. A joint probability density function Pr(x ne ) is assigned to each 
neighbor edge set, where x ne denotes the assignments restricted to 
the random variables of a neighbor edge set, ne. 

A probabilistic graph has uncertain edges but deterministic ver- 
tices. The probability function Pr(x ne ) is given as a joint probabil- 
ity table of random variables of ne. For example, the probabilistic 
graph 002 in Figure 1 has 2 joint probability tables associated with 
2 neighbor edge sets, respectively. 

Definition 3. (Possible World Graph) A possible world graph 
g' = (V', E' , E', L') is an instantiation of a probabilistic graph 
g = ((V,E,T,,L),X E ), where V = V, E' C E, E' C E. We 
denote the instantiation from g to g' as g => g' . 

Both g and g c are deterministic graphs. But a probabilistic 
graph g corresponds to one g c and multiple possible world graphs. 
We use PWG(g) to denote the set of all possible world graphs de- 
rived from g. For example, Figure 2 lists 4 possible world graphs 
of the probabilistic graph 002 in Figure 1 . 

Definition 4. (Conditional Independence) Let X, Y, and Z 

be sets of random variables. X is conditionally independent ofY 
given Z (denoted by X _L Y\Z) in distribution Pr if: 

Pr(X = x;Y = y\Z = z) = Pr(X = x\Z = z) 

Pr(Y = y\Z = z) 

for all values x £ dom(X), y € dom(Y) and z £ dom(Z). 

Following real applications [9, 28, 18, 16], we assume that any 
two disjoint subsets of Boolean variables, Xa and Xb of Xe, are 
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conditionally independent given a subset Xc (Xa -L Xb\Xc), if 
there is a path from a vertex in A to a vertex in B passing through 
C. Then, the probability of a possible world graph g' is given by: 

Pr(g^g')= ]J Pr(x ne ) (1) 

neeNS 

where NS is all the sets of neighbor edges of g. 

For example, in probabilistic graph 002 of Figure 1, {ei, e2} _L 
{e4,es}|e3. Clearly, for any possible world graph g , we have 
Pr(g => g') > Oand Y, g > eP wG(g) Pr (s => fl 1 ') = 1, that is, each 
possible world graph has an existence probability, and the sum of 
these probabilities is 1 . 

Definition 5. (Subgraph Isomorphism) Given two determinis- 
tic graphs gi = (Vi, Ei, Ei, Li) and g^ = (V2, -E2, £2, £2), we 
say gi w subgraph isomorphic to g2 (denoted by g± C 4so ^2), z/ 
a«d on/y if there is an injective function f : V\ — > V2 such that: 

• for any (it, v) £ Ei, there is an edge (/(it), f{v)) £ -B2; 

• for any it £ Vi, L±(u) = L,2(f(u)); 

• /or any (it, w) £ Li(u, «) = L2(f(u), /(«)). 

77te subgraph (V3, £3) 0/(72 VfiV/l V3 = {/(v)|l> £ Ki} and E-j, = 
{(/(it), /(«))[(«, «) £ is called the embedding of gi in g^. 

When gi is subgraph isomorphic to 02, we also say that gi is a 
subgraph of 02 and 32 is a super-graph of gi . 

Definition 6. (Subgraph Isomorphism Probability) For a de- 
terministic graph f and a probabilistic graph g, we define their 
subgraph isomorphism probability (SIP) as, 

Pr(fC iso g)= Y, Pr (g^g') ( 2 ) 

g'£SUB(f,g) 

where SUB(f, g) is g's possible world graphs that are super-graphs 
off, that is, SUB(fg) = {«/ £ PWG(g)\f C lso g'}. 

Definition 7. (Maximum Common Subgraph-MCS) Given two 
deterministic graphs gi and g2, the maximum common subgraph of 
gi and g2 is the largest subgraph of g2 that is subgraph isomorphic 
to gi, denoted by mcs(gi, g2). 

Definition 8 . ( Subgraph Distance ) Given two deterministic gr- 
aphs gi and g2, the subgraph distance is, dis(gi, g2) = |gi| — 
\mcs(gi , 32) |- -Were, |gi| ami |mcs(gi, 52)! denote the number of 
edges in gi and mcs(gi, g2), respectively. For a distance threshold 
5, ifdis(gi ,g2) < S, we call gi « subgraph similar to g2- 

Note that, in this definition, subgraph distance only depends on 
the edge set difference, which is consistent with pervious works on 
similarity search over deterministic graphs [38, 15, 30]. The oper- 
ations on an edge consist of edge deletion, relabeling and insertion. 

Definition 9. (Subgraph Similarity Probability) For a given 
query graph q, a probabilistic graph g 3 and a subgraph distance 
threshold 5, we define their subgraph similarity probability as, 

Pr(qc sim g)= Y Pr (g^g') O) 

g'eSIM(q,g) 

where SIM(q, g) is g's possible world graphs that have subgraph 
distance to q no larger than 5, that is, SIM(q,g) = {g' £ PWG(g 
)\dis(q,g')<5}. 

'without loss of the generality, in this paper, we assume query graph is a 
connected deterministic graph, and probabilistic graph is connected. 
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Figure 3 : The probabilistic graph g and query graph q constructed for 
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Figure 4: Probabilistic Matrix Index (PMI) & features of probabilistic 
graph database 

Problem Statement. Given a set of probabilistic graphs D — 
{gi, ...,g n }, a query graph q, and a probability threshold e (0 < 
e < 1), a subgraph similar query returns a set of probabilistic 
graphs {g\Pr(q C slm g) > e, g £ D}. 

2.2 Problem Complexity 

From the problem statement, we know that in order to answer 
probabilistic subgraph similar queries efficiently, we need to cal- 
culate SSP (subgraph similarity probability) efficiently. We now 
show the time complexity of calculating SSP. 

Theorem 2. // is ^P-complete to calculate the subgraph simi- 
larity probability. 

Proof. Due to space limit, we do not give the full proof and just 
highlight the major steps here. We consider a probabilistic graph 
whose edge probabilities are independent from each other. This 
probabilistic graph model is a special case of the probabilistic graph 
defined in Definition 2. We prove the theorem by reducing an arbi- 
trary instance of the #P-complete DNF counting problem [13] to 
an instance of the problem of computing Pr(q C sim g) in poly- 
nomial time. Figure 3 illustrates an reduction for the DNF formula 
F = (yi A 2/2) V (yi A %)2 A 1/3) V (y 2 Ays). In the figure, the graph 
distance between q and each possible world graph g' is 1 (delete 
vertex w from g). Each truth assignment to the variables in F cor- 
responds to a possible world graph g' derived from g. The proba- 
bility of each truth assignment equals to the probability of g' that 
the truth assignment corresponds to. A truth assignment satisfies F 
if and only if g', the truth assignment corresponds to, is subgraph 
similar to q (suppose graph distance is 1). Thus, Pr(F) is equal to 
the probability, Pr(q C s4m g). ■ 

3. PROBABILISTIC PRUNING 

As mentioned in Section 1.2, we first conduct structural pruning 
to remove probabilistic graphs that do not approximately contain 
the query graph q, and then we use probabilistic pruning techniques 
to further filter the remaining probabilistic graph set, named SC q . 
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3.1 Pruning Conditions 

We first introduce an index structure, Probabilistic Matrix In- 
dex (PMI), to facilitate probabilistic filtering. Each column of the 
matrix corresponds to a probabilistic graph in the database D, and 
each row corresponds to an indexed feature. Each entry records 
{LowerB(f), UpperB(f)}, where UpperB(f) and LowerB(f) 
are the upper and lower bounds of the subgraph isomorphism prob- 
ability of / to g, respectively. 

Example 2. Figure 4 shows the PMI of probabilistic graphs in 
Figure I. 

Given a query q, a probabilistic graph g and subgraph distance 8, 
we generate a graph set, U = {rqi, .., rq a }, by relaxing q with 8 
edge deletions or relabelings 4 . Here, we use the solution proposed 
in [38] to generate {rqi, ..,rq a }. Suppose we have built the PMI. 
For each g € SC q , in PMI, we locate 

D g = {(LowerBifj), UpperB(M\f, C lso g c , 1 < j < \F\] 

For each rqi, we find two graph features in D g , {fl , ff }, such that 
rqi ^>i SO fl and rqi C iso ff, where 1 < i < a. Then we have 
probabilistic pruning conditions as follows. 

Pruning \.(subgraph pruning) Given a probability threshold e 
and D g , if J21=i UpperB(fl) < e, then g can be safely pruned 
from SC q . 

Pruning 2.(super graph pruning) Given a probability threshold e 

and D g , if zZ1=i Lower B{ff) - J2i<i,j< a Upper B(ff)Upper- 
B(fj) > e > tnen 9 is m me nna l answers, i.e., g € A q , where A q 
is the final answer set. 

Before proving the correctness of the above two pruning con- 
ditions, we first introduce a lemma about Pr(q C sim g), which 
will be used for the proof. Let Brqi be a Boolean variable where 
1 < i < a, Brqi is true when rqi is subgraph isomorphic to g c , 
and Pr(Brqi) is the probability that Brqi is true. We have 

Lemma 1. 

Pr(q C slm g) = Pr(Br qi V ... V Brq a ). (4) 

Proof. From Definition 9, we have 

Pr(q<Z eirn g)= ]T Pr(g => g') (5) 

g'eSIM(q,g) 

where SMI(q,g) is a set of possible world graphs that have sub- 
graph distance to q no larger than 8 Let d be the subgraph distance 
between q and g c . We divide SIM(q, g) into 8 — d + 1 subsets 5 , 
{SMo, SMs-d}, such that a possible world graph in SMi has 
subgraph distance d + i with q. Thus, from Equation 5, we get 

C„ g) = ^ Pr( 3 S ') 

g / GSM 1 U...USM (5 _ d 

= E E jm»=>9')- E E Pr ( 

o<ji<s-<i s'eSMjj «<Ji<i2<«-'is'eSM 31 nSM j 2 

9^ </) + ••• + (-!)* E E JMfl=>ff') 

o<j! <...<j i <5-d s ' e SMjj n...n sm^ . 

+ --- + (-if- d E Pr[g=>g'). 

g >ESM J1 n...nSM js _ d 

(6) 



4 According to the subgraph similarity search, insertion does not change the 
query graph. 

5 For g £ SC q , we have d < 8, since the probabilistic graphs with d > 8 
have been filtered out in the deterministic pruning. 



Let Li, < i < S — d, be the graph set after q is relaxed with 
d + i edges, and BLi be a Boolean variable, when BLi is true, 
it indicates at least one graph in Li is a subgraph of g c . Consider 
the ith item on the RHS in Equation 6, let A be the set composed 
of all graphs in i graph sets, and B = BL jl A ... A BLj i be the 
corresponding Boolean variable of A. The set g' £ SMj 1 n ... H 
SMj i contains all PWGs that have all graphs in A. Then, for the 
ith item, we get, 

(-I) 1 E E Pr(g^g') 

o<j 1 <.-.<j i <s-d g'esMj n...nSM.-. 

(7) 

= (-1)' J2 Pr(BL 31 A...ABL H ). 

0<j 1 <-.-<j i <<5 — d 

Similarly, we can get the results for other items. By replacing 
the corresponding items with these results in Equation 6, we get 

Pr(qC iso g)= J2 Pr ( BL i) ~ E Pr(BLj 1 A BLj 2 ) 

0<j 1 <5-d 0<j 1 <j 2 <S-d 

+ ••• + (-1)' E Pr(BLj 1 A ... A BLj^ 

0<ji < . . . <ji <8 — d 

+ ■■■ + (-l) 5 - d Pr(BL n A ... A BLj s _ d ). 

(8) 

Based on the Inclusion-Exclusion Principle [26], the RHS of 
Equation 8 is Pr(BL V ... V BLs-d). Clearly, BL C ... C 
BLs-d, then 

Pr(BL V ...V BL s _ d ) = Pr(BLs- d ) = Pr(Br qi V ...V Brq a ) ■ 

Lemma 1 gives a method to compute SSP Intuitively, the prob- 
ability of q being subgraph similar to g equals to the probability 
that at least one graph of the graph set U = {rqi, rq a } is a sub- 
graph of g, where U is remanning graph set after q is relaxed with 
8 edges. With Lemma 1, we can formally prove the two pruning 
conditions. 

Theorem 3. Given a probability threshold e and D g , if 'Yl" = i 
Upper B(fl) < t, then g can be safely pruned from SC q . 

Proof. Since rqi Z) i30 fl, we have Brqi V ... V Brq a C Bfl V 
...WBfa, where Bfl is a Boolean variable denoting the probability 
of fl being a subgraph of g for 1 < i < a. Based on Lemma 1, we 
obtain 

Pr(q C sim g) = Pr(Brqi V ... V Brq a ) 
<Pr(B/ 1 1 V...VS/ a 1 ) 

< Pr(B/ 1 1 ) + ... + Pr(S/ Q 1 ) 

< UpperB(fl) + ... + UpperB(fl) < e. 

Then g can be pruned. ■ 

Theorem 4. Given a probability threshold e and D g , if '^21 =1 L- 
moerB(ft) - Y ll < i , i < a UpperB(ft)UpperB(fi) > e, then 
g G A q , where A q is the final answer set. 

Proof. Since V? =1 Brgi D Vf =1 S/f, we can show that 

Pr(q C 3im g) = PriBrqx V ... V Brq a ) 
>Pr(B/ 1 2 V...VB/f) 

>E^W, 2 )- E Pr(Bf?)Pr(Bf*) 

i=l l<i,j<a 

>J2 LowerB (f?)~ E UpperB(ff)UpperBUf) 

i=l l<i,j<a 
> « 
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/i ° ° Si:{rq,,rq 2 } W(Si)=0.4 

fz o = o S 2 :{rq 2 ,rq 3 } W(S 2 )=0.1 

h ° o S 3 :{rq,,rq 3 } W(S 3 )=0.5 

Figure 5: Obtain tightest U 3im (q) 

Then g G A q . ■ 

Note that the pruning process needs to address the traditional 
subgraph isomorphism problem (rq C iso / or rq D iso /). In our 
work, we implement the state-of-the-art method VF2 [10]. 

3.2 Obtain Tightest Bounds of subgraph sim- 
ilarity probability 

In pruning conditions, for each rqi (1 < i < a), we find only one 
pair feature {fl , ff}, among \F\ features, such that rqi Dj so fl 
and rqi C 4so ff. Then we compute the upper bound, U S im(q) = 
J2i=i UpperB(fl) and the lower bound L 3im (q) = ^"=i Lowe- 
rB(fi) - Ei<ij< B UpperB(ff)UpperB{fi). However, there 
are many //s and f?s satisfying conditions among F features, 
therefore, we can compute a large number of U 3 i m (q)s and L 3 i m (q) 
-s. For each rqi, if we find x features meeting the needs among \F\ 
features, we can derive x a U 3 i, n (q)s. Let x — 10 and a = 10, then 
there are 10 10 upper bounds. The same holds for L B i m (q). Clearly, 
it is unrealistic to determine the best bounds by enumerating all the 
possible ones, thus, in this section, we give efficient algorithms to 
obtain the tightest U a im{q) and L sim (q), 

3.2.1 Obtain Tightest U 3im (q) 

For each fj (1 < j < \F\) in PMI, we determine a graph set, 
Sj, that is a subset of U = {rqi, ...,rq a }, such that rqi G Sj 
s.t. rqi 3 iso fj- We also associate Sj a weight, UpperB(fj). 
Then we obtain \ F\ sets {si, .., s\p\ } with each set having a weight 
w(sj) = Upper B(fj) for 1 < j < \F\. With this mapping, 
we transform the problem of computing tightest U S im{q) into a 
weighted set cover problem defined as follows. 

Definition 10. (Tightest U 3 i m (q) ) Given a finite set U = {rqi , 
...,rq a } and a collection S = {si, .., Sj, .., Sijpi} of subsets of U 
with each Sj attached a weight w 3j , we want to compute a subsect 
C C S t o minimize 2~2 3je c w(sj) s.t. Usee Sj = U. 

It is well-known that the set cover problem is NP-complete [13], 
we use a greedy approach to approximate the tightest U S i m (q). Al- 
gorithm 1 gives detailed steps. Assume the optimal value is OPT, 
the approximate value is within OPT ■ ln\U\ [12]. 



Algorithm 1 0btainTightestC/ 3 i m (q)(;7, S) 

1: A<-$, U Bim (q) = 0; 

2: while A is not a cover of U do 

j'. for each s £ S, compute 7 (s) — 

l« - 

4: choose an s with minimal 7 (s); 

5: A^A(Js; 

6: U sim (q)+ = iu(a); 

7: end while 

8: return U s i m (q); 



Example 3. In Figure 1, suppose we use q to query probabilis- 
tic graph 002, and the subgraph distance is 1. The relaxed graph 



set of q is U = {rqi, rq2, rq$} as shown in Figure 5. Given in- 
dexed features {/i,/2,/3j» we first determine Si = {rqi,rq2}, 
S2 = {rq2,rq-i} and S3 = {rqi,rqs}. We use the UpperB(fj), 
1 < j < 3, as weight for three sets, and thus we have w(si) — 0.4, 
w(s2) ~ 0.1 and 111(83) = 0.5. Based on Definition 10, we 
obtain three U a i m (q)s, which are 0.4+0.1=0.5, 0.4+0.5=0.9 and 
0.1+0.5=0.6. Finally the smallest (tightest) value, 0.5, is used as 
the upper bound, i.e., U a i m (q) = 0.5. 

3.2.2 Obtain Tightest L aim (q) 

For lower bound L a i m {q), the larger (tighter) L a i m (q) is, the 
better the probabilistic pruning power is. Here we formalize the 
problem of computing largest L a i m (q) as an integer quadratic pro- 
gramming problem, and develop an efficient randomized algorithm 
to solve it. 

For each /, (1 < i < in PMI, we determine a graph set, 
Si, that is a subset of U — {rqi, rq a }, such that rqj G s, s.t. 
rqj Cj SO fi. We associate Si a pair weight of {LowerB(fi), Upper 
B (/»)}• Then we obtain \F\ sets {si,..,S| F |} with each set having 
a pair weight {uii(sj), Wu(si)} for 1 < i < \F\. Thus the prob- 
lem of computing tightest L S i m (q) can be formalized as follows. 

Definition 11. (Tightest L S i m (q)) Given a finite set U — {rqi, 
...,rq a } and a collection S — {si, sipi} of subsets of U with 
each Si attached a pair weight {wL,(si), wu(si)}, we want to com- 
pute a subsect C C {s\, s\p\ \ to maximize 

^ Whjsj) - wu(si)wu(sj) 

Associate an indicator variable, x Si , with each set Si G S, which 
takes value 1 if set Si is selected, otherwise. Then we want to: 

Maximize ^2 x Si WL.(si) - V> x ai x aj wu(si)wu(sj) 
S.t. ^ x at >l \frq G U, 
x a G {0,1}. 

(9) 

Equation 9 is an integer quadratic programming which is a hard 
problem [13]. We relax x Si to take values within [0, 1], i.e., x 3i G 
[0, 1]. Then the equation becomes a standard quadratic program- 
ming (QP). Clearly, this QP is convex, and there is an efficient solu- 
tion to solve the programming [23]. Since all feasible solutions for 
Equation 9 are also feasible solutions for the relaxed quadratic pro- 
gramming, the maximum value QP(I) computed by the relaxed 
QP provides an upper bound for the value computed in Equation 9. 
Thus the value of QP(I) can be used as the tightest lower bound. 
However, the proposed relaxation technique cannot give any theo- 
retical guarantee on how tight QP(I) is to Equation 9 [12]. 

Now following the relaxed QP, we propose a randomized round- 
ing algorithm that yields an approximation bound for Equation 9. 
Algorithm 2 shows the detailed steps. According to Equation 9, 
it is not difficult to see that more elements in U are covered, the 
tighter L 3 i m {q) is. The following theorem states that the number 
of covered elements of U has a theoretical guarantee. 

Theorem 5. When Algorithm 2 terminates, the probability that 
all elements are covered is at least 1 — tftt . 
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Algorithm 2 ObtainTightestL sim (<?)(£/, S) 
1: C <- <j>,L 8im (q) = 0; 

2: Let a;* be an optimal solution to the relaxed QP; 
3: for k = 1 to 2Zn|U| do 

4: Pick each s £ S independently with probability x* 
5: if s is picked then 



C 



8 : end if 
9 : end for 
10: return L s 



s; 

■m(?); 



|C| 

4 (g) + tux, (a) - wu{s) wu(si); 

i=i 



rq 2 a/ 



/2 

/a 



Si:{rqi,rq 2 } 
S 2 :{rq 2 ,rq 3 } 
S 3 :{rq,,rq 3 } 



W(S 0=0.4 
W(S 2 )=0.1 
W(S 3 )=0.5 



Figure 6: Obtain tightest L sirn (q) 



Proof. For an element rq £ U, the probability of rq is not 
covered in an iteration is 



IK* 

rq£s 



< e 



< 



1 



Then rq is not covered at the end of the algorithm is at most 
g — 2log\U\ — 1_ Thus, the probability that there is some rq that 

is not covered is at most \U\ ■ 1/\U\ 2 — l/\U\. ■ 

Example 4. /n Figure 1, suppose we use q to query probabilis- 
tic graph 002, and the subgraph distance is 1. The relaxed graph 
set of q is U = {rqi, rq2, rq$} shown in Figure 6. Given in- 
dexed features {/i,/2}, we first determine Si = {rqi} and S2 = 
{rqi,rq2,rqs}. Then we use {LowerB(fi),UpperB(fi)}, 1 < 
i < 2, as weights, and thus we have {uii,(si) = 0.28, lOcr(si) = 
0.36}, {wl{s2) = 0.08,wu(s2) = 0.15}. Based on Definition 
11, we assign L s i m (q) — 0.31. 

4. PROBABILISTIC MATRIX INDEX 

In this section, we discuss how to obtain tight {Lower B(f), Upp- 
erB(f)} and generate features used in probabilistic matrix index 
(PMI). 

4.1 Bounds of Subgraph Isomorphism Proba- 
bility 

4.1.1 LowerB(f) 

Let Ef — {/i, .., f\Ef\} be the set of all embeddings 6 of fea- 
ture / in the deterministic graph g c , Bfi be a Boolean variable for 
1 < i < \Ef\, which indicates whether fi exists in g c or not, 
and Pr(Bfi) be the probability of the embedding /, exists in g. 
Similar to Lemma 1 , we have 



Pr(f C lso g) = Pr{Bh V ... V Bf\ Ef 



(10) 



According to Theorem 2, it is not difficult to see that calculating 
the exact Pr(f C iso g) is NP-complete. Thus we rewrite Equation 
10 as follows 

6 In this paper, we use the algorithm in [36] to compute embeddings of a 
feature in g c 



Pr(f C jao g) = Pr(Bh V ... V Bf ]Ef] ) 



= l-Pr{Bfx A...A£/,„|) 



> l-Pr(B/iA...AB/|ijv| I 
B f\iN\+i A ... A B/| E j|). 



(11) 



where IN = {Bh,...,Bf lINl } C Ef. 

Let the corresponding embeddings of Bfi, 1 < i < \IN\, do 
not have common parts (edges). Since g c is connected, these \IN\ 
Boolean variables are conditionally independent given any random 
variable of g. Then Equation 1 1 is written as 



Pr(f Qiso g) > 1 - Pr(Bf 1 A ... A B/|x N | | Bf ]IN] + l A ... A B/|e/|) 



= 1 - [1 - Pr(Bf ( | B/| /N | + 1 A ... A B/, B/ |)]. 



(12) 



(13) 



For variables Bf x ,Bf y G {Bf IN \ +1 , Bf lEfl }, we have 

PrfBfi A Bf m A Bf v ) 

Pr(Bfi\Bf„ A Bf v ) — — — ^ 

Pr{Bf x /\Bf y ) 

= PrjBfj A B/j, A Bf v )/Pr(Bf y ) 
Pr{Bf x hBf v )/Pr{Bf v ) 

= PrjBfj A Bf x \Bf y ) 
Pr{Bf^\Bf y ) 

If .B/i and B/a; are conditionally independent given Bf y , then 
Pr(B/, A Bf^Bfy) = Pr{Bf i \Bf y )Pr(Bfx\Bf y ). (14) 

By combining Equations 13 and 14, we obtain 

PriBfWBf* A Bf y ) = Pr(B/i|B/ v ). (15) 

Based on this property, Equation 12 is reduced to 

I'M 

Br(/ C lso g) > 1- JJ [l-Pr(B/< | B/, IW | +1 A ... A Bf W \)\ 

i = l 

|W| 

= i - n i 1 - pj -( B /- 1 B h a ... a b/| C |)] 



1- 11 [l-BrCB/ilCOB)] 



(16) 



where CO R — Bfi A ... A B/| c | , and the coiTesponding embedding of Bfj £ 
C — {Bfi , B/|^.| } overlaps with the corresponding embedding of Bfi. 

For a given Bfi, Pr(Bfi\COR) is a constant, since the num- 
ber of embeddings overlapping with fi in g c is constant. Now we 
obtain the lower bound of Pr(f C iso g) as 



|JJV| 

LowerB(f) = 1 - ]J [1 Pr(B/ I |COJ?)], (17) 

i = l 

which is only dependent on the selected |7iV| embeddings that do 
not have common parts with each other. 

To compute Pr(B fi\COR), a straightforward approach is the 
following. We first join all the joint probability tables (JPT), and 
meanwhile multiply joint probabilities of joining tuples in JPTs. 
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Figure 7: Embeddings & fG of feature /j in probabilistic graph 002 

Then, in the join result, we project on edge labels involved in Bfi 
and COR, and eliminate duplicates by summing up their existence 
probabilities. The summarization is the final result. However, this 
solution is clearly time inefficient for the sake of join, duplicate 
elimination, and probability multiplication. 

In order to calculate Pr(Bfi\COR) efficiently, we use a sam- 
pling algorithm to estimate its value. Algorithm 3 shows the de- 
tailed steps. The main idea of the algorithm is as follows. We first 
sample a possible world g' . Then we check the condition, in Line 
4, that is used to estimate Pr(Bfi A COR), and the condition, 
in Line 7, that is used to estimate Pr(COR). Finally we return 
ni/n-i which is an estimation of Pr(Bfi A COR)/Pr{COR) = 
Pr(Bf\COR). The cycling number m is set to (4Zn|)/r 2 (0 < 
£ < 1, t > 0) used in Monte Carlo theory [26]. 

Algorithm 3 CalculatePr(,B/;| COB) (g, Bfi, COB) 

1: m — 0, ri2 — 0; 
2: for i — 1 to m do 

3: Sample each neighbor edge set ne of g according to Pr(x nG ), and then 

obtain an instance g' ; 
4: if g' has embedding fi & no embeddings involved in COB, then 

5: m+ = i; 

6: end if 

7: if g' has no embeddings involved in COB, then 

8: „ 2 + = l; 

9: end if 

10: end for 

1 L return ni/ri2', 



Example 5. In Figure 4, consider f-2, a feature of probabilis- 
tic graph 002 shown in Figure 1. fi has three embeddings in 
002, namely EMI, EM2 and EA43 as shown in Figure 7. In 
corresponding BfiS, Bfi and Bfz are conditionally independent 
given Bf2. Then based on Equation 17, we have Lower B(f) = 
1 - [1 - Pr(B/i|B^)][l - Pr(Bf 3 \Bf2)] = 0.26. 
As stated early, Lower B(f) depends on embeddings that do not 
have common parts. However, among all \Ef\ embeddings, there 
are many groups which contain disjoint embeddings and leads to 
different lower bounds. We want to get a tight lower bound in order 
to increase the pruning power. Next, we introduce how to obtain 
tightest Lower B(f). 

Obtain Tightest Lower Bound We construct an undirected graph, 
fG, with each node representing an embedding fi, 1 < i < \Ef\, 
and a link connecting two disjoint embeddings (nodes). Note that, 
to avoid confusions, nodes and links are used for fG, while ver- 
tices and edges are for graphs. We also assign each node a weight, 
— ln[l — Pr(B fi\COR)]. In fG, a clique is a set of nodes such 
that any two nodes of the set are adjacent. We define the weight of 
a clique as the sum of node weights in the clique. Clearly, given 
a clique in fG with weight v, LowerB(f) is 1 — e~ v . Thus, the 
larger the weight, the tighter (larger) the lower bound. To obtain a 
tight lower bound, we should find a clique whose weight is largest, 
which is exactly the maximum weight clique problem. Here we use 
the efficient solution in [7] to solve the maximum clique problem, 
and the algorithm returns the largest weight z. Therefore, we use 
1 — e~ z as the tightest value for LowerB(f). 



Example 6. Following Example 5, as shown in Figure 7, EMI 
is disjoint with EMS. Based on the above discussion, we con- 
struct fG, for the three embeddings, shown in Figure 7. There 
are two maximum cliques namely, {EMI, EMS} and EM2. Ac- 
cording to Equation 1 7, the lower bounds derived from the 2 max- 
imum cliques are 0.26 and 0.11 respectively. Therefore we select 
the larger (tighter) value 0.26 to be the lower bound of fa in 002. 
4.1.2 UpperB(f) 

Firstly, we define Embedding Cut: For a feature /, an embedding 
cut is a set of edges in g c whose removal will cause the absence of 
all f's embeddings in g c . An embedding cut is minimal if no proper 
subset of the embedding cut is an embedding cut. In this paper, we 
use minimal embedding cut. 

Denote an embedding cut by c and its corresponding Boolean 
variable (same as Bf ) by Be, where Be is true indicating that the 
embedding cut c exists in g c . Similar to Equation 10, it is not diffi- 
cult to obtain, 



Pr(f C lso g) = 1 - Pr(B Cl V ... V B C | Ec |) 



= Pr(Bci A ... A Bci Ec i) 



(18) 



where Ec = {ci, C| Bc |} is the set of all embedding cuts of 
/ in g c . Equation 18 shows that the subgraph isomorphism prob- 
ability of / to g equals the probability of all /'s embedding cuts 
disappearing in g. 

Similar to the deduction from Equation 10 to 17 for Lower B(f), 
we can rewrite Equation 18 as follows 

Pr(f C„„ g) = Pr(B^A...ABc lEcl ) 

< Pr(Bci A ... A Bc| /Jv /| |SC| JJV /| + 1 A ... A Bc ]Ea] ) 

I'N'I 

= II [1 - Pr(Bc i \B^ IN , H1 /\ ... /\ Bc lEol )] 

i = l 

IXW'I 

= n I 1 - Pr(Bci\Ba A ... A Sc| D |)] 

IN' 

= Yl l 1 - Pr(Ba\COM)] 

(19) 



where IN' — {Bci, -Bet jj^/ t } is a set of Boolean variables whose correspond- 
ing cuts are disjoint, COM — Bc\ A ... A Bc\d\, and the corresponding cut of 
Bcj £ D — {Bci, _Bc|£)|} has common parts with the corresponding cut of 
Ba. 

Finally we obtain the upper bound as 

\IN'\ 

UpperB{f) = [1 - Pr{Ba\COM)]. (20) 

The upper bound only relies on the picked embedding cut set in 
which any two cuts are disjoint. 

The value of Pr{Bci\COM) is estimated using Algorithm 3 by 
replacing embeddings with cuts. Similar to lower bound, comput- 
ing tightest UpperB( f) can be converted into a maximum weight 
clique problem. However, different from lower bound, each node 
of the constructed graph fG represents a cut and has a weight of 
—ln[l — Pr(Bci\COM)] instead. Thus, for the maximum weight 
clique with weight v, the tightest value of Upper B(f) is e~ v . 

Now we discuss how to determine embedding cuts in g c . 
Calculation of Embedding Cuts 

We build a connection between embedding cuts in g c and cuts 
for two vertices in a deterministic graph. 
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Figure 8: Transformation from embeddings of to parallel graph cG 

Suppose / has \Ef\ embeddings in g c , and each embedding has 
k edges. Assign k labels, {ei, ...,ek}, for edges of each embed- 
ding (the order is random.). We create a corresponding line graph 
for each embedding by (1) creating k + 1 isolated nodes, and (2) 
connecting these k + 1 nodes to be a line by associating k edges 
(with corresponding labels) of the embedding. Based on these line 
graphs, we construct a parallel graph, cG. The node set of cG con- 
sists of all nodes of the \Ef\ line graphs and two new nodes, s and 
t. The edge set of cG consists of all edges (with labels) of the \Ef\ 
line graphs. In addition, one edge (without label) is placed between 
an end node of each line graph and s. Similarly, there is an edge 
between t and the other end node of each line graph. As a result, 
\Ef\ embeddings are transformed into a deterministic graph cG. 

Based on this transformation, we have 

Theorem 6. The embedding cut set of g c is also the cut set 
(without edges incident to s and t) from s to t in cG. 

In this work, we determine embedding cuts using the method in 
[22]. 

Example 7. Figure 8 shows the transformation for feature fa 
in graph 002 in Figure 1. In cG, we can find cuts {e2 , ei\, {&x , e-j , 
e^} and \ei , 63} which are clearly the embedding cuts of f^ In 002. 

4.2 Feature Generation 

We would like to select frequent and discriminative features to 
construct probabilistic matrix index (PMI). 

To achieve this, we consider UpperB(f) given in Equation 20, 
since upper bound plays a most important role in the pruning ca- 
pability. According to Equation 20, to get a tight upper bound, 
we need a large disjoint cut set and a large Pr(Bci\COM). Sup- 
pose the cut set is IN". Note that \IN"\ = \IN'\, since a cut 
in IN" has a corresponding Boolean variable Ba in IN'. From 
the calculation of embedding cuts, it is not difficult to see that a 
large number of disjoint embeddings leads to a large \IN"\. Thus 
we would like a feature that has a large number of disjoint embed- 
dings. Since |COM| is small, a small size feature results in a large 
Pr(Bci\COM). In summary, we should index a feature, which 
complies with following rules: 

Rule 1. Select features that have a large number of disjoint em- 
beddings. 

Rule 2. Select small size features. 

To achieve rule 1 , we define the frequency of feature / as frq(f) 
— { g l-f-"° g •\ I1 ^W\ E f\> cl '9eD}\ ^ wnere a j s a threshold of the 

ratio of disjoint embeddings among all embeddings. Given a fre- 
quency threshold f3, a feature / is frequent iff frq(f) > /3. Thus 
we would like to index a frequent feature. To achieve rule 2, we 
control a feature size used in Algorithm 4. To control feature num- 
ber [37, 29], we also define the discriminative measure as: dis(f) = 

^ Dl '\D J \ ~ ' wnere is the list probabilistic graphs g s.t. 
/ ^iso g c ■ Given a discriminative threshold 7, a feature / is dis- 
criminative, iff dis(f) > 7. Thus we should also select a discrimi- 
native feature. 



Based on the above discussion, we select frequent and discrim- 
inative features, which is implemented in Algorithm 4. In this al- 
gorithm, we first initial a feature set F with single edge or vertex 
(line 1-4). Then we increase feature size (number of vertices) from 
1, and pick out desirable features (line 6-9). maxL is used to con- 
trol the feature size, and guarantees picking out a small size feature 
satisfying rule 2. frq(f) and dis(f) are used to measure the fre- 
quency and discrimination of feature. The controlling parameters 
a, P and 7 guarantee picking out feature satisfying rule 1. The 
default values of the parameters are usually set to 0.1 [37, 38]. 



Algorithm 4 FeatureSelection(D, a, 0, 7, maxL) 



I: 


F <- 0; 


2: 


Initial a feature set F with single edge or vertex; 


3: 


Df <- {g\f C ISO g'}; 


4: 


F^Fu{f}; 


5: 


for i — 1 to maxL do 


6: 


for each feature / with i vertices do 


7: 


if frq(f) > 0&dis(f) > 7 then 


8: 


Df <- [g\f C lso g"}; 


9: 


F^FU{f}; 


10: 


end if 


11: 


end for 


12: 


end for 


13: 


return F\ 



5. VERIFICATION 

In this section, we present the algorithms to compute subgraph 
similarity probability (SSP) of a candidate probabilistic graph g to 

q- 

Equation 4 is the formula to compute SSP. By simplifying this 
equation, we have 

Pr(q C sim g) = J2(-1Y J2 Pri^Brq,). (21) 

>=1 JC{l,...,«),|J|=i 

Clearly, we need exponential number of steps to perform the ex- 
act calculation. Therefore, we develop an efficient sampling algo- 
rithm to estimate Pr(q C sim g). 

By Equation 4, we know there are totally a Brqs that are used to 
compute SSP. By Equation 10, we know Brq — B/iV...VS/i B yi, 
Then, we have, 

Pr(qC sim g) = Pr(Bf 1 V...VBf rn ) (22) 

where m is the number of Bfs contained in these a Brqs. 

Assume m Bfs have xi, ...,Xk Boolean variables for uncer- 
tain edges. Algorithm 5 gives detailed steps of the sampling algo- 
rithm. In this algorithm, we use junction tree algorithm to calculate 
Pr(Bfi) [17]. 



Algorithm 5 Calculate Pr(q C sim g) 

1: Cnt = 0, V = Y.T=l Pr(Bfi); 

2: N = (4in2/£)/-r 2 ; 

3: for 1 to iVdo 

4: randomly choose i £ { 1, m} with probability Pr(Bfi )/V\ 

5: randomly choose Xi,..,Xk (according to probability Pr(x ne )) with 

{0, 1} s.t. Bfi = 1; 
6: if Bfi = A ... A B/i-i = Othen 
7: Cnt = Cnt + 1; 

8 : end if 

9: end for 

10: return Cnt/N; 



6. PERFORMANCE EVALUATION 

In this section, we report the effectiveness and efficiency test 
results of our new proposed techniques. Our methods are imple- 
mented on a Windows XP machine with a Core 2 Duo CPU (2.8 
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GHz and 2.8 GHz) and 4GB main memory. Programs are com- 
piled by Microsoft Visual C++ 6.0. In the experiments, we use a 
real probabilistic graph date set. 

Real Probabilistic Graph Dataset. The real probabilistic graph 
dataset is obtained from the STRING database 7 that contains the 
protein-protein interaction (PPI) networks of organisms in the Bi- 
oGRID database 8 . A PPI network is a probabilistic graph where 
vertices represent proteins, edges represent interactions between 
proteins, the labels of vertices are the COG functional annotations 
of proteins 9 provided by the STRING database, and the existence 
probabilities of edges are provided by the STRING database. We 
extract 5K probabilistic graphs from the database. The probabilis- 
tic graphs have an average number of 385 vertices and 612 edges. 
Each edge has an average value of 0.383 existence probability. Ac- 
cording to [9], the neighbor PPIs (edges) are dominated by the 
strongest interactions of the neighbor PPIs. Thus, for each neighbor 
edge set ne, we set its probabilities as: Pr(x ne ) = maxi<i<\ ne \ 
Pr(xi), where Xi is a binary assignment to each edge in ne. Then, 
for each ne, we obtain 2' ne probabilities. We normalize those 
probabilities to construct the probability distribution, of ne, that is 
input into algorithms. Each query set qi has 100 connected query 
graphs and query graphs in qi are size-i graphs (the edge number in 
each query is i), which are extracted from corresponding determin- 
istic graphs of probabilistic graphs randomly, such as §50, gl00, 
</150, g200 and g250. In scalability test, we randomly generate 2k, 
4K, 6K, 8K and 10K data graphs. 

The setting of experimental parameters is set as follows: the 
probability threshold is 0.3-0.7, and the default value is 0.5; the 
subgraph distance is 2-6, and the default value is 4; the query size 
is 50-250, and the default value is 150. In feature generation, the 
value of maxL is 50-250, and the default value is 150; the values 
of {a, /3, 7} are 0.05-0.25, and the default value is 0.15. 

As introduced in Section 1.2, we implement the method in [38] 
to do structural pruning. This method is called Structure in ex- 
periments. In probabilistic pruning, the method using bounds of 
subgraph similarity probability is called SSPBound, and the ap- 
proach using the best bounds is called OPT-SSPBound. To imple- 
ment SSPBound, for each rq t , we randomly find two features sat- 
isfying conditions in probabilistic matrix index (PMI). The method 
using bounds of subgraph isomorphism probability is called SIP- 
Bound, and the method using the tightest bound approach is called 
OPT-SIPBound. In verification, the sampling algorithm is called 
SMP, and the method given by Equation 21 is called Exact. Since 
there are no pervious works on the topic studied in this paper, we 
also compare the proposed algorithms with Exact that scans the 
probabilistic graph databases one by one. The complete proposed 
algorithm of this paper is called PMI. We report average results in 
following experiments. 

In the first experiment, we demonstrate the efficiency of SMP 
against Exact in verification step. We first run structural and prob- 
abilistic filtering algorithms against the default dataset to create 
candidate sets. The candidate sets are then verified for calculat- 
ing SSP using proposed algorithms. Figure 9(a) reports the result, 
from which we know SMP is efficient with average time less than 3 
seconds, while the curve of Exact decreases in exponential. The ap- 
proximation quality of SMP is measured by the precision and recall 
metrics with respect to query size shown in Figure 9(b). Precision 
is the percentage of true probabilistic graphs in the output proba- 
bilistic graphs. Recall is the percentage of returned probabilistic 
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graphs in all true probabilistic graphs. The experimental results 
verify that SMP has a very high approximation quality with preci- 
sion and recall both larger than 90%. We use SMP for verification 
in following experiments. 

Figure 10 reports candidate sizes and pruning time of SSPBound, 
OPT-SSPBound and Structure with respect to probability thresh- 
olds. Recall that SSPBound and OPT-SSPBound are derived from 
upper and lower bounds of SIP. Here, we feed them with OPT- 
SIPBound. From the results, we know that the bars of SSPBound 
and OPT-SSPBound decrease with the increase of probability thresh- 
old, since larger thresholds can remove more false graphs with low 
confidences. As shown in Figure 10(a), the candidate size of OPT- 
SSPBound is very small (i.e., 15 on average), and is smaller than 
that of SSPBound, which indicates that our derived best bounds are 
tight enough to have a great pruning power. As shown in Figure 
10(b), OPT-SSPBound has short pruning time (i.e., smaller than Is 
on average) but takes more time than SSPBound due to more sub- 
graph isomorphic tests during the calculation of OPT-SSPBound. 
Obviously, probabilities do not have impacts on Structure, and thus 
both bars of Structure hold constant. 

Figure 1 1 shows candidate sizes and pruning time of SIPBound, 
OPT-SIPBound and Structure with respect to subgraph distance 
thresholds. To examine the two metrics, we feed SIPBound and 
OPT-SIPBound to OPT-SSPBound. From the results, we know that 
all bars increase with the increase of subgraph distance thresh- 
old, since larger thresholds lead to a large remaining graph set 
which is input into the proposed algorithms. Both OPT-SIPBound 
and SIPBound have a small number of candidate graphs, but OPT- 
SIPBound takes more time due to additional time for computing 
tightest bounds. From Figures 10(a) and 11(a), we believe that 
though Structure remains a large number of candidates, the prob- 
abilistic pruning algorithms can further remove most false graphs 
with efficient runtime. This observation verifies our algorithmic 
framework (i.e., structure pruning-probabilistic pruning- verifica- 
tion) is effective to process queries on a large probabilistic graph 
database. 

Figure 12 examines the impact of parameters {maxL, a, /3, 7} 
for feature generation. Structure holds constant in the 4 results, 
since the feature generation algorithm is used for probabilistic prun- 
ing. From Figure 12(a), we know the larger maxL is, the more 
candidates SSPBound and OPT-SSPBound have. The reason is that 
the large maxL generates large sized features, which leads to loose 
probabilistic bounds. From Figure 12(b), we see that all bars of 
probabilistic pruning first decrease and then increase, and reach 
lowest at the values 0.1 and 0.15 of a. As shown in Figures 12(c) 
and 12(d), both bars of OPT-SIPBound decrease as the values of 
parameters increase, since either large /? or large 7 results in fewer 
features. 

Figure 13 reports total query processing time with respect to 
different graph database sizes. PMI denotes the complete algo- 
rithm, that is, a combination of Structure, OPT-SSPBound (feed 
OPT-SIPBound) and SMP. From the result, we know PMI has quite 
efficient runtime and avoids the huge cost of computing SSP (#P- 
complete). PMI can process queries within 10 seconds on average. 
But the runtime of Exact grows in exponential, and has gone be- 
yond 1000 seconds at the database size of 6k. The result of this 
experiment validates the designs of this paper. 

Figure 14 examines the quality of query answers based on prob- 
ability correlated and independent models. The query returns prob- 
abilistic graphs if the probabilistic graphs and the query (subgraph) 
belong to the same organism. We say the query and probabilis- 
tic graph belong to the same organism if the subgraph similarity 
probability is not less than the threshold. In fact the STRING 
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database has given real organisms of the probabilistic graphs. Thus 
we can use the precision and recall to measure the query quality. 
Precision is the percentage of real probabilistic graphs in the re- 
turned probabilistic graphs. Recall is the percentage of returned 
real probabilistic graphs in all real probabilistic graphs. To de- 
termine query answers for the probability independent model, we 
multiply probabilities of edges in each neighbor edge set to obtain 
joint probability tables (JPT). Based on the JPTs, we use PMI 
to determine query answers for the probability independent model. 
Each time, we randomly generate 100 queries and report average 
results. In the examination, COR and IND denote the probabil- 
ity correlated and probability independent models respectively. In 
the figure, precision and recall go down as probability threshold is 
larger, since large thresholds make query and graphs more difficult 
to be categorized into the same organism. We also know that the 
probability correlated model has much higher precision and recall 
than the probability independent model. The probability correlated 
model has average precision and recall both larger than 85%, while 
the probability independent model has values smaller than 60% at 
threshold larger than 0.6. The result indicates that our proposed 
model behaves more accurate biologic features than the probability 
independent model. 
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Figure 9: Scalability to query size. 
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7. RELATED WORK 

In this paper, we study similarity search over uncertain graphs, 
which is related to uncertain and graph data management. Readers 
who are interested in general uncertain and graph data management 
please refer to [3] and [4] respectively. 
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Figure 13: Total query processing Figure 14: Query quality 
time. comparison (COR V.S. IND). 



The topic most related to our work is similarity search in de- 
terministic graphs. Yan et al. [38] proposed to process subgraph 
similarity queries based on frequent graph features. They used a 
filtering-verification paradigm to process queries. He et al [15] em- 
ployed an R-tree like index structure, organizing graphs hierarchi- 
cally in a tree, to support fc-NN search to the query graph. Jiang 
et al [19] encoded graphs into strings and converted graph similar- 
ity search into string matching. Williams et al [35] aimed to find 
graphs with the minimum number of miss-matchings of vertex and 
edge labels bounded by a given threshold. Zeng et al [41] proposed 
tight bounds of graph edit-distance to filter out false graphs in sim- 
ilarity search, based on which, Wang et al [34] developed an index- 
ing strategy to speed up query. Shang et al [30] studied super-graph 
similarity search, and proposes top-down and bottom-up index con- 
struction strategy to optimize the performance of query processing. 
Recently, Sun et al [32] proposed a subgraph matching algorithm 
on distributed in-memory graphs without using structured index. 

Another related topic is querying uncertain graphs. Potamias et 
al [27] studied fc-nearest neighbor queries (fc-NN) over uncertain 
graphs, i.e., computing the k closest nodes to a query node. They 
proposed sampling algorithms to answer the #P-complete k-NN 
queries. Zou et al [42, 43] studied frequent subgraph mining on un- 
certain graph data under the probability and expectation semantics 
respectively. Yuan et al [40] proposed graph feature-based frame- 
work to conduct uncertain subgraph graph query. In another work, 
Yuan et al [39] and Jin et al [21] studed shortest path query and 
distance-constraint reachability query in a single uncertain graph. 
The above works define uncertain graph models with independent 
edge distributions and do not consider edge correlations. 

8. CONCLUSION 

This is the first work to answer the subgraph similarity query 
on a large probabilistic graphs with correlation on edge probabil- 
ity distributions. Though it is an NP-hard problem, we employ 
the filter-and-verify methodology to answer the query efficiently. 
During the filtering phase, we propose a probabilistic matrix (PMI) 
index with tight upper and lower bounds of subgraph isomorphism 
probability. Based on PMI, we derive upper and lower bounds of 
subgraph similarity probability, and we compute best bounds by 
developing deterministic and randomized optimization algorithms. 
We also propose selective strategies for picking powerful subgraph 
features. Therefore we are able to filter out large number of prob- 
abilistic graphs without calculating the subgraph similar probabil- 
ities. During verification, we use the Monte Carlo theory to fast 
validate final answers with a high quality. Finally, we confirm our 
designs through an extensive experimental study. 
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